Europaisches European 
Patentamt Patent Office 



Office europeen 
des brevets 



BEST AVAILABLE COPY U i ^ aug aow 

[WIPO PCT 

Bescheinigung Certificate Attestation 



Die angehefteten Unterla- 
gen stimmen mit der 
ursprQnglich eingereichten 
Fassung der auf dem nach- 
sten Blatt bezeichneten 
europaischen Paten tan mel- 
dung uberein. 



The attached documents 
are exact copies of the 
European patent application 
described on the following 
page, as originally filed. 



Les documents fixes a 
cette attestation sont 
conformes a la version 
initialement deposee de 
la demande de brevet 
europeen specifiee a la 
page suivante. 



Patentanmeldung Nr. Patent application No. Demande de brevet n° 

03077522.5 



Der Prasident des Europaischen Patentamts; 
Im Auftrag 

For the President of the European Patent Office 

Le President de I'Office europeen des brevets 
p.o. 



RCvan Dljk 

PRIORITY 
DOCUMENT 

SUBMITTED OR TRANSMITTED IN 
COMPLIANCE WITH RULE 17.1(a) OR (b) 




Europaisches 
Patentamt 



European 
Patent Office 



Office europeen 
des brevets 



Anmeldung Nr: 

Application no.: 03077522.5 
Demande no: 



Anmeldetag: 

Date of filing: 08.08.03 
Date de depot: 



Anmelder/Appl tcant( s)/Demandeur( s): 

Koninklijke Philips Electronics N.V. 
Groenewoudseweg 1 
5621 BA Eindhoven 
PAYS-BAS 



Bezelchnung der Erf 1ndung/T1 tie of the 1nvent1on/Tl tre de I 1 invention: 
(Falls die Bezel chnung der Erflndung nlcht angegeben 1st, slehe Beschrel bung. 
If no title 1s shown please refer to the description. 
S1 aucun tltre n'est 1nd1qu£ se referer & la description.) 

System for browsing a collection of information units 

In Anspruch genommene Prlorlat(en) / Priori ty( 1es) claimed /Pr1or1t6(s) 
revend1quee( s) 

Staat/Tag/Aktenze1chen/State/Date/Flle no./Pays/Date/Num<§ro de depdt: 



Internationale Patentklassl f 1 kat1 on/International Patent Classification/ 
Classification Internationale des brevets: 

G06F1/00 



Am Anmeldetag benannte Vertragstaa ten/Contracting states designated at date of 
flllng/Etats contractants designees lors du depot: 



AT BE BG CH GY CZ DE DK EE ES FI FR GB GR HU IE IT LU MC NL 
PT R0 SE SI SK TR LI 



03077B22.5 
EPA/EP0/0EB Form 1014.2 - 01.2000 



7001014 



2 



8. AUG. 2033 13:59 PHILIPS CIP NL +31 40 27434B9 

PHNL031006BPQ 



NO. 760 P. 7/41 
007 08.08.2003 14:56:2 



1 



08.08.2003 



System for browsing a collection of information units 



FIELD OF THE INVENTION 

The invention relates to method for content recommendation. 

The invention further relates to a system and computer program product for ' 
implementing the above method. 

5 

BACKGROUND OF THE INVENTION 

Collaborative filtering is a method for content recommendation that combines 
interests of a large group of users* Typically, the information gathering i$ done on a server 
(portal)- Prior art 

1Q Collaborative Filtering. Memory-based collaborative filtering techniques are 

based on determining correlations (similarities) between different users, for which the ratings 
of each user axe compared to the ratings of each other user. Typical similarity measures that 
are used are the Pearson correlation and the kappa statistic, or variants thereof. Next, these 
similarities are used to predict how much a particular user will like a particular piece of 

15 content. Also for the prediction step, several alternatives exists that may slightly differ from 
each other. Apart from determining similarities between users, one may determine 
similarities between items, based on the rating patterns they received from the users. For this 
dual approach, one can use similar similarity measures and prediction Amotions as in the 
above. A problem in this context is the protection of the privacy of the users, who don T t want 

20 to reveal their interests to a server or to other users. 
Methods exist for the following two problems: 

1 f Given two parties that each have a secret vector of integers, determine the 

inner product between the vectors without any of the parties having to reveal the specific 
information. 

25 2. Given a set of parties that each have a secret number, determine the ston of the 

numbers without any of the parties having to reveal the number. 

The former can be done using e.g. the Paillier ciyptosystem. The latter 
problem can be handled by using a key-sharing scheme (also Paillier), where decryption can 
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OBJECT AND SUMMARY OF THE INVENTION 
5 It is an object of the invention to provide an improved system and method of 

the type defined in the opening paragraph. To that end, the method according to the invention 
comprises a step of collecting at a central server encrypted rating vectors from at least two 
users, a stop of collaborative filtering using the encrypted rating vectors so as to protect the 
users* privacy, and a step of sending a content recommendation to a user. 
1 0 The invention is to protect the users' privacy, given by their rating 

information, by rewriting the computational steps required for the collaborative filtering 
algorithm into vector inner products and sums of shares, after which, we apply the mentioned 
encryption techniques to protect them. Tti a sense, this means that only encrypted information 
is sent to the central server, and all computations are done in the encrypted domain. 
15 The key benefit of the invention is that user information is protected, The 

invention can be used in various kinds of recommendation services, such as music or TV 
show recommendation, but also medical or financial recommendation implications. Jn the 
latter cases, privacy protection may be even more important than in the former ones. 

Suppose we want to predict the score of an item i for active user a. 
20 1 , First, we compute the correlation between user a and every other user x. This is done by 
computing inner products between the rating vector of user a and each other user x, through 
an exchange via the server. In this way, user a knows the correlation value with each other 
user x=l 3 2 B ,.. a but he does not know who user l,2,„„,n is. On the other hand, the server 
knows who user 1 A...,n is> but be doesn't know the correlation values. 
25 2. Next, we compute a prediction for item i for user a by taking a kind of weighted average of 
the ratings of user l,2,„.,n for this item, where the weights aie given by the correlation 
values. The procedure for this is that user a encrypts the correlation values and sends them to 
the server, who forwards them to the respective users I A-.,n. Each user x=l,2,...,n multiplies 
the encrypted correlation value he receives with the rating he gave &r item i, and sends the 
30 result back to the server. The server, still not able to decrypt anything at all, then combines 
the encrypted products of the users 1,2,...^ into an encrypted sum, and sends flus end result 
back to user a, who can decrypt it to get the desired result 
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Although the invention will be described with reference to particular 
illustrative embodiments, variants and modifications are possible within the scope of the 
inventive concept 

The use of the verb r to comprise 1 and its conjugations does not exclude the 
5 presence of elements or steps other than those defined in a claim. In the claims, any reference 
signs placed between parentheses shall not be construed as limiting the claim. The invention 
can be implemented by means of hardware comprising several distinct elements, and by 
means of a suitably programmed computer. In the device claim enumerating several means, 
several of these means can be embodied by one and the same item of hardware. 
1° A 'computer program* is to be understood to mean any software product 

stored on a computer-readable medium, such as a floppy-disk, downloadable via a network, 
such as the Internet, or marketable in any other manner. 
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Chapter i 



Introduction 



Hie explosive growth of the world wide web has led to a situation where people axe confronted with an 
overload of UTttncmation. In orcjerto relieve the problem of searching through all the information for 
interesting items, a wide range of systems is being developed These systems can he divided into three 
groups: information retrieval systems, information filtering systems and recommendation sys tems , 

Information reftiipal systems allow users to express queries to select documents that match a topic 



Information filtering systems use the same techniques as Information retrieval systems, hut are 
meant for a stream of incoming documents. In this case the user has a long-term Interest in 
certain topics. 

RecoYnwendctfiim systems try to make a choice for a user based on Ms likes and dislikes . 

An advantage of recommendation systems is that they cm surprise the user with new, unknown items 
that he presumably likes. There are two kinds of recommendation systems: 

Content-based systems use the content of the items and the user's preferences in the past to make 
a choice for the user. 

Co UaharativejUtmng systcrns recommend items to a user based on preference^ of other users. 

CoHabazanve filtering has some advantages over the other method. Subjective attributes such as qual- 
ity; darity and presentation style can be taken into account, since the system uses knowledge of other 
peoplewho have accessed the item rather than properties of the content As analysis of the content of 
an item is not necessary, the system can handle any type of data. However, collaborative filtering has 
some disadvantages too* 

• Users must give their preference? about a lot of Items before the system will work 

• New items can only be recommended after some users have evaluated them, 

• Popular items, such as songs from The Beatles, are more often recommended, 

A collaborative filtering system, called Jukebox, has been developed at Philips Research. This system 
is described in Section Lt The Jukebox system still has some flaws, which are described in Section 
1.2. these daws lead us to requirements for a new system* which are given in the problem statement 
(Section 1,3). We conclude this chapter with an outline of the report in Section ta^ 
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1.1 Jukebox system 

The Jukebox reccmmender [14, 15] is a collaborative filtering system that recommends songs to users, 
Ihe users have a music player (jukebox) at home, with which they can listen to certain songs. When 
the users have listened to a song, they can describe their taste by rating this song via the interface of 
the ju&ebox The songs are rated on a scale from I to 5. A rating (or vote) of 1 indicates that the user 
dislikes a song while a rating of 5 means that the user likes the song very much. The jukeboxes of the 
different users are aH connected to a computer (server) which calculates the recommendations. For 
this purpose the server needs die ratings of the users, When a user makes a rating, the jukebox fiends 
this rating via the connection to the server. The server stares the ratings in a large database (Figure 




Figure rx A schematic view of the Jukebox recommender system. 



Computing similarities between users. We first introduce the terms profile and active user. A profile 
is the list with the songs the user rated combined with the ratings the user gave to the songs. The 
active user is the user for whom we want to make recommendations Ifthe active user wants a new 
song to listen to, the system searches in the database with all profiles for those user profiles that match 
the profile of the active user. In the Jukebox system thekappa statistic [513, aSj is used as a measure for 
matching (or similarity) between two users. The tappa statistic takes values between 0 and I, where a 
0 Indicates that there is no matching at alL while a 1 indicates a perfect match. The kappa statistic is 
described in more detail in Section 

Generation of a recommendation list. Tb generate a recommendation for an active user, the following 
steps are performed. 

1. All kappa coefficients ibr the active user are computed. 

2. The coefficients that are lower than a certain threshold axe ignored. Ifeese coefficients corre- 
spond to users with a taste different from the active user. 

3. Select the user that corresponds to the largest 
aimtbr to the profile of the active user. 

4. Add all songs with a topscore (4 or 5} that the active user did not vote on to the Hst of recom- 
mended songs. Increment the number of vote3 by one for each song added to the list Proceed 
with the user with the next best match. 

S* Recommend the songs with the highest number of votes first. Most of the users surnlar to the 
active user like these songs, 

The similarities between the users are not calculated afterevery update, because this is a lot of work. 
In the Jukebox system it was decided to calculate the similarities once a week 
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1 .2 Issues in collaborative filtering systems 

Hie xecommend^Hons made with a standard eoHab oratfve filtering system (such as the Jukebox sys- 
tern) are quits good, butthe system stifl has some flaws. 

• The server stores the ratings of the users. This is very personal informotioa that should be 
protected against unwanted use. His server extracts information from ms data the user send$, 
such as relationships between, certain songs. This information is used by the server to offer 
valuable services. Far that reason, information valuable to the server should he protected too, 

• The Jukebox algorithm requires a computation that grows vim the number of users and songs, 
In the next chapter we shall see that the time complexity for the Jukebox algorithm is quadratic 
in the number of users. This means that if the number of users doubles, the computation time 
will be multiplied by four; 

• The correlation measure (or kappa statistic) calculates the similarities based on songs the users 
both rated (Section It is difficult to find users who rated the same songs as the active user, 
as there am a lot of songs and even a user who listens very often can not listen to all of them* 
This means mat the correlation measure or the kappa statistic is very unreliable, as it is based 
on only a few songs. A lot of users are therefore not able to receive recommendations. 

1.3 Problem statement 

We as gume that a new to build system has one server available, which is connected with the music 
players of the users. The music player could be the internet radio player Streamiura, which is de- 
veloped at Philips Research. Of course, mere exist a lot of users with such a player, therefore we 
demand that the system can cope with a lot of peopJe r rbr instance ro,ooc.. For me security of the user 
information and the server information, we derived the following requirements. 

• A user may not know the rating of any other (anonymous) user for a given song. 

• A user may not know which songs any other (anonymous) user rated. 

• A user may not know any data valuable to the serves; such as dependencies between songs. 

• The server may not know the rating of any user for a given song. 

• The server may not know which songp any user rated. 

• The server may not know which user resembles any other user. 

• The server determines how many songs are recommended. If the user gets all the recommen- 
dations at once, he would use the recnrnmender only one time, and the server would not make 
any profit 

Finally, the quality of prediction should be at least as good as the quality of prediction of the Jukebox 
system, 

1 .4 Outline of the report 

In the next chapter an overview of collaborative filtering al^rithms from the literature is given. Chap- 
ter 3 introduces the cryptographic techniques we use in order to secure the user profiles and the data 
valuable to the server. In the Chapters 4,5, and 6 we derive security protocols for the algorithms given 
in Chapter 1 . We implemented the factor-analysis algorithm and tested it with the Philips EasyAccesa 
database. This database is the result of an earlier experiment with the Jukebox system. Ike results of 
the test with the ractor^analysj^ algorithm can be found in Chapter 7 and 8. 
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Chapter 2 

Collaborative PAtering Algorithms 



SS2Sff ,BS M6 ^ diVided *"° cate ^ xnemoxjrW algorithms 
MemwyJKWed flfeariftwf upethfi database with votes to calculate •distances' between users. New 

tw category, bemg uses-based and iterated algorithms. The user-based algorithms ace 
described in Section a.2. while tfce itam-based algorithms are described in Section!^ 

Mo^^ofeffftftiK use the database with votes to build a model, which is then used for cal- 
culating predictions. An example is the feetoxvaaalysis afeorithm, which Is described in Section 
34. 

an example which is used throughout the text 



2.1 Notation 



IT Bnd ^? a ^P?V to sell new flavours to their customers. The company sells 
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lable a.B Customer opinions of tea and coffee flavours, 
/eJ^fSffiu* V** wnslderQd 5S a smmg didifcefor Uie flavour while a 0 meant that fceuser 
(fr ifcfe case flavours) TOfctfe arnrespondiag votes tram a certain user IslSedifcs profile of the user. 
**** *?***£f would probablylike its product A collaborativo ^gSmte 

^ P if^? for thesis high enough, m the next sections we discuss how predictions canbe 
made, for which wa win use Che notation as given in Figure 2a 
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Symbol Description 

V number of users 

I number ofltems 

a t y users 

a active user 



i.4 

^ vote from users for item 4 

mean vote of user a? over his rated items 

v s vector with votes from user a;, where a 0 

indicates that an item had not "been rated 

T{Bi is 1 if user a? rated ton i and 0 otherwise 

r a vector with elements 

Xj, set with items user a? rated 

C7| set of users who rated item * 

^ similarity 

Prt prediction for user a and item 4 

Figure 2.1: Notation. 

2.2 User-based algorithms 

User- Dosed algorithms are probably the oldest and most widely used collaborative filtering algorithms 
f* «, j&, ja, tgij. An example of a user-based algorithm is the one used in the Jukebox system- Krst 
a 'similarity measure is calculated between every pair of users, indicating to what extent their profiles 
match- Next, a prediction for item i is calculated by f airi n g a weighted average over the users who gave 
their preference about item i> where the similarities between users define the weights. If users have 
ahigk similarity with the active user, their influence in the predicted -rote is bigger. There arc a lot 
of possibilities for the choice of the similarity measure and the way in which we make predictions, as 
wa show in Sections 2.3 jc and For the simuarity measure we can for instance choose distance 
metrics, metrics based on counting the number of items both users liloe or correlation metrics. The 
one that is most often used is 

stay) jgfeOfiifr^rH^^ M 

called the Pearson correlation, and the corresponding prediction for user a and item i is given by 

Not onhr users with a high correlation influence a prediction for the active user in (a.a). Users wnh a 
very negative correlation have a taste opposite t» the active user, so if a user with a negative correlation 
H&es an item, the active user will probably dislike it 

Example zjz Suppose Aukje wants to know her expected preference for itemT^ \5re start cnlculattng 
the similnrities between Ank}e and the other users with the Pearson correlation The mean 
^Auki = ^-Hff** 3 * 2 w %k. The mean of user Jan can be calculated in the same way, $Jan 
Tl^et/A^n/jan =* {Tl> Ol ,02,03,08}. Ifwe fili in the proper values, we obtain 

2,a . (-2.3) - IS : 1.7 + 0.3 ■ 1.7 - 0-8 ; 0,7 - 0.8 ; (-Q.S) 
s(AukjeJan) « ^ + ^ + Q ^ + Q ^ ^ 6 + x ^ + 1>72 + 0 ^ + 0i3a 

3.1 • 3.4 
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A similar calculation can be performed for the other pate of users, resulting in similarity values as 
shown in Table Note that the correlation is symmetric, le. the correlation between Jan and Aukje 
is the same as the correlation between Aukja and Jan, 





VWm Jan Amout Aukfe 


Wirn 
Jan 

Amout 
Auk)e 


O.78 

0.96 -0,74 

-©.85 -0,77 0.85 



Table Hie Pearson correlation between the users in the example of Table ax 
If we use predictor (2^) then the prediction for Aulqe and is 

T 0.S5-H 0.77 + 0.33 

This means that Aukje will probably li&e tea T£. 

If we want predictions for all missing votes, then we have to calculate OfJJ*) forrnlanties. Every 
similarity consists of three sums over the items. Therefore the similarity^calculation phase will cost 
0(U 2 I) time- The prediction phase has the frame time complexity: every user wants 0(1) predictions 
and every prediction consist of a sum over the users. The total algorithm thus has a time complexity 

2.2.1 Similarity measure for the user-based algorithm 

We mentioned earlier that there are a lot of similarity measures. The similarity can be a distance 
between two pinnies , the correlation or a measure of the number of equal votes between two profiles. 
For the calculation of the predictions, it Is necessary that the similarities are high if the users have the 
same taste, and low if they have an opposite taste. 

Distance measures 

1 Hie distance calculates the total difference in votes "between the users. The distance is zero if the users 
have exactly the same taste. The distance is high if the users behave totally opposite. Therefore we 
have to do an adjustment such that the weights are high if the users vote the same. A simple distance 
measure is the Manhattan distance [1, xij, which is given by 

Another distance measure is the mean-square difference [25], which is based on the squared difference 
between the votes. The mean-styuare diEerenee is given by 

Example z.2: The Manhattan distance between Wim and Ankje is 

(|2~6| + |l-4| + J4-l| + |4^ai + |4-2|)/B«3.0. 

Tire M aw l™Ha n ^^c^l****™*^ Amout nnd Aufrje fa ft A The maximum possible distance is 6—1 ™ 4 
while the minimum distance is 0. To make proper weights, we adjust this distance by subtracting the 
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real distance from die mean distance 2. The weight between Wira and Aukje is hence 2 - 2.0 » —0.6 
while the weight between Amout and Aukje is 2 - 0.8 «=> 1.2* We can calculate the mean-square 
difference between Wrrn a»4 Aukje in the garnn way. This distance is equal to 7. The mean-square 
difference between Amout and Aukje is still 0.8, The mean*square difference takes values between 
0 and (5 - l) a = 16. To make proper weights, we subtract 2 2 & 4, where 2 is the mean difference 
between the votes. If we took ft, the weight between "wlm and Aukje would be positive, indicating that 
they are similar; The weight between Wrm and Aulcje is 4 - 7 = -3 while the weight between Amout 
and Aukje is 4-0.8 = 3,2. The mean«so 1 uare difFerence discriminates better between the users with 
the game taste and users with an opposite taste. However, users with an opposite taste receive high 
negative weight?, as the weights are between -12 and 4, 

Correlation measures 

We have already seen the Pearson correlation (2.x), We njentioned that a high correlation (close to 1) 
indicates that the users have similar profiles, while a low correlation (close to -1) indicates that users 
have an opposite taste. However, as the correlation is actually a measure of the linear relationship 
between two users, this is not always true. For instance, if user or votes 1 for all items and user y votes 6 
for all items, then the correlation between the users a? and 2/ is 1, When we use tjie Pearson correlation 
as similarity measure, we make the assumption that the means of the users are approximately the 
same. The Pearson correlation has a variant called the constrained Pearson correlation [14], which 
does not have this problem, and which is given by 

where c is a constant. The only difference between the Pearson correlation and the constrained Pear- 
son correlation is that instead of the mean of the users a value 0 is substituted. Hie constrained 
Pearson correlation can be used to avoid the computation of the mean of the user. As our rating scale 
is 1 to 5, we assume that the mean vote is 3. If we usec = 0, the constrained Pearson correlation is 
called vector similarity or cosine fcj, as the angle between the vectors of users sandy is measured By 
choosing a low value tor c we can indicate that similarity between high scored songs is mora important 
than similarity between low scored songs. 

Jtomple a.?: If we use the constrained Pearson, with e « 2, then the correlation between Aukje and 

3*0 + 2. HQ-1.3 + Q, 2+0*2 

v/(9 + 4 + l). (1+4+4+4) "" MB 

One constrained Pearson between Aulqe and Amout is 0.86. Hie correlation between Amout and 
Anl^e is stronger compared to the Pearson correlation, while the correlation between Wm and Aukje 
is less strong, 

Counting measures; 

These measures arebased on the number of times th&t two rated an item similarly. A simple counting 
measure is the majority young measure [x6], given by * 

*{*>V) - (2 - 7) W**. M) 

where 0 < 7 < 1, and is the number of items that the users rated the same, while vi^ 
is the number of items the users rated differently. More precisely, define, a set Civ) as the set 
of votes that am considered equal to a value v, then - \{i e I a p e O^)}] and 

«W ■* \{i € I a n l^tfy g )}|. Usually, the set is symmetric, Le. a e 0(b) b e C(a\ In 
that case we have cq, « and <w xv => w^, so the.sirruTarity is symmetric too. It can be shown that, 
If you give the algorithm based on Pearson correlation and the algorithm based on majority voting a 
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cape where they make the most mistakes then majority voting makes fewer mistakes [rf>|, Here, the al- 
gorithm ia said to make a mistake if the round off value of the prediction does not equal the actual vote. 

In the Jukebox system the lappa statistic s6] was tested as a mottle far matching between two 
user?. It was found that it ia better to not only look for identical votes, but also to consider nearly- 
identical votes> although with a lower weight. Therefore a variant was used, called the weighted kappa 
[*4t *5l> which is given by 

™-K*»- °i?j*rf* . to) 



fe the observed proportion of agreement and 



6 6 



(3.9) 



is Ote degree of agreement polely on. basis of diaace. The probability (v, to) that user x voted v an<J 
user p voted tula given by 



(a.io) 



IF votes were mad© purely randomly, the estimated expected probability that user w voted v and user y 
voted w would be ^ (v f iti), which is given by 



<U2* 



(a.rx) 



The weights w uw are chosen such that 0 < «=£ 2 > uro ^lr Wuw . These weights reflect the extent 
in which two votes v and w are considered equal. A good weight matrix is depicted in ISble a.3. 
Weighted Iqappa varies from negative values to a value of one abdicating perfect agreement. 
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a 


3 
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r/a 
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i/a 




x/a 


0 


0 
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*/* 


x 


x/a 


O 


4 


O 


0 




1 


i/a 


J5 
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ThWe a.3: A weight matrix for the weighted kappa statistic. 



Example M : Here we describe the calculation of me weighted kappa for Aukje and Arnout Rrstwe 
calculate the mictions p a? depicted in 7Ub2e 2.4. The fractions q are obtained by multiplying the row 
and column sums of p. rbr example 1) = 2/5 - 1/5 » 2/25. If we use the weight matrix of Table 
2,5, then the kappa statistic between Aulcje and Arnout ia about 

0.6 - 0.42 _ M 
1-0.42 



The kappa statistic between Aulcje nndWhnis about -Q-5U 
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Table The fectionp between Auk)e and Arnout for the example ofTable ajf. 



2,2,2 Predictions for the user-based algorithm 

la this section we describe three predicants and its variants. We assume that the similarities s(n, v) 
are chosen such that they are high if the users are similar and negative if they have opposite testes. 
We start with the standard predictor given by 



In Example »,r we summed over all the other users. We also could have summed over the users 
similar to the active user. Often the users are selected according to a certain threshold, and the users 
with a high similarity (above the threshold) are used to make predictions. A simple* predictor is the 
following _ , , 

where the sum is over all users, or over the users with a similarity above a certain threshold. We 
already mentioned the majority voting similarity measure. The predictor that is. used in combination 
with this similarity is the majority voting predictor, which is given by 



Pai* 



ergmaxveti,...^ ^£ x:Vatie o(v) fl (*i «) if mere are values v xil 
c else, 



(«4) 



where c is a constant Note mat the majority voting similarity is always positive. The predictor can be 
adapted in order to deal with negative similarities To malse a prediction, we simply try all me possible 
votes, the vote with the maximum total similarity is the prediction. With the majority voting predictor, 
on item can be recommended not only when similar users liked the item, but also when a lot of users 
liked the item. 

Bsample a.jt Suppose we define the sets C(l) 0(2) = {l,2} f C(3) m {3} and C<4) - £7(5) = 
{4, 5} . Choose me parameter 7 = 0.5. Now the similarity (2.6) between Aukfe and Amout is equal to 

s(Aukje,Aroout) 1.5* ■ <X5 l « 2.53. 

The same can be done tor the other pairs of users r resulting in similarity values as shown in Table 3.5* 
Again we want to calculate a prediction for Au&je and tea T3. The votes 1 and 2 Have weight of about 
O.03 + 0.03 =5 0.06- Vote 3 has weight 0. Votes 4 and 6 have weight 2.53. The latter votes have the 
maximum weight, and the prediction is therefore that Aukje like? the tea, As the votes 4 and 5 receive 
the some weight, the both votes are equally likely. 

2.2.3 Onlins user-based algorithms 

The user-based algorithms we described so far, need a total recalculation of the similarities if they 
become obsolete. Online algorithms adapt me tfnrilarines directly as a vote arrives. The similarity for - 
majority voting can be easily translated into an online algorithm, At the beginning of the algorithm 
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wrm Jan Amout Aukjs 


Wfcn 
Jan 

Amout 
Aukje 


1.^7 

o,oz o.oa 

0.03 0^33 a,53 



Table 2.5: Majority voting simHarity for the example of table aj, 

me similarity /?(s f y) » 1, for each pair of users a?, and we choose a parameter 0 < 7 < X. When a 
new vote Vua arrives, we set 



if * 6 7 V and v^i e , 
if < g 4 and £ ff(Vi«><)i 
if*g/ v . 



(a-«5) 



One constrained Pearson correlation can be turned into an online algorithm too. Suppose user a and 
user 2/ have sets J m and /j, of Sterns they already rated. Now user x makes a new vote for item 
This item is added to the set I aw creating a new sat j£ =3 u{j}. Then me new constrained Pearson 
correlation given by 



E^ g r fl ^ w (^-o)fcyi~c)^(^w-c)(^-c) 



provided that item / is rated by user 9» If user y did not rate Item ) 9 the similarity between the 
users does not change* To calculate the constrained Pearson correlation incrementally, we maintain 
three sum? tor each pair of users, as shown in (2,17), the incremental calculation of the Pearson 
calculation is a bit more complex, but stiH possible. The predictions can also be made incremental- 
The predictions change when the active user a or another user <b makes new votes, therefore we 
distinguish the fallowing four coses, 

r. User a makes a new vote for item while user a: did not rate item i. In this cage the sim£iarrtjf 
sfa a?) does not change, therefore the predictions do not change. Of course, the prediction for 
user a and item & is not necessary anymore, 

2. User x makes a new vote tor item 4, while user a did not rate item &, In this case the similarity 
s(a, x) does not change, however the set is changed to C# = TJi U {$}. Hence, the prediction 
tor item i is changed into 



Pai « 



where we use the simple predictor, which is given by (a.13). 

User x makes a new vote tor item i, while user a already rated item i. Now the similarity 3 (a, a) 
changes into s / (a > a?) = s(a,aj) + A. The predictions with the simple predictor (2.13) for the 
items ^ e /a- that user a did not rate, change as follows 



PaJ 



f^evAiffl) K a « v)l + ° Etf dam + K*.*) + *| 

2^ ifi(a,2/)| - K*,sO| + K^)!* 
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4. User o makes a new vote for item i, white user x already rated item i. The predictions cau be 
adapted in the way described in the preceding case. Of course, the prediction for item i is not 
necessary anymore. 

The simple predictor can thus be made incremental by maintaining the denominator and the nomi- 
nator separately. The standard predictor and the majority voting predictor can be made incremental 
in a similar way. 

2.3 Item-based algorithms 

Item-based algorithms [13, 31] act oppositely to user-based algorithms in the sense that instead of 
calculating similarities between users, similarities between items are calculated. The adjusted cosine 
similarity 

can be used to calculate matchings between tens. The cjfferEnce between the Pearson correlation 
and the adjusted cosine is that the mean of the user is subtracted instead of the mean of the item. In 
pn] the correlation and the adjusted cosine were tested with an item-based algorithm. The latter was 
found the best. Some users tend to give higher votes than other users. Ifwe want to compare the votes 
of two items for different users, we have to scale it according to the usee Otherwise, users who tend 
to give high votes have too much influence. The standard item-based predictor for user a and item i 
is given by ~ 4 

Now we need to calculate 0(1*) rfrnffaTitfe s. while the sums are calculated over the users. Therefore 
the similarity-calculation phase costs 0(PU) time. The prediction phase has the same time complex- 
itjt The total algorithm has a time complexity c£0{PU) m 

The similarity measures and predictors for the user-based algorithm can all be turned into sim- 
ilarity measures and predictor? for the item-based algorithms. We expect the item-based algorithm 
to have about the same performance as the user-based algorithm. The item-based algorithm should 
theoretically be used if them are more users man items, then the algorithm is fester, hut also me 
predictions wfll be better, as the snttfarities are based on more data. 

2.4 Factor analysis 

Factor analysis is a generalization of singular value decomposition fc, 7, 20 f a] and linear regression 
M- 'with fcctor analysis we try to make a linear model that can describe the users preferences, given 
by 

V^AX+N, 

where V is an J x IT matrix with, the user profiles as cdurnns. We assume that the user profiles are 
generated by a random process, which depends on X. The entries of the kxV matrix are assumed 
to be standard normally distributed. The noise (N) is normally distributed with mean 0 and variance 0 
Hence, the user profiles are normally distributed with mean 0 too. In order to satisfy this assumption," 
we subtract the mean of the users from their profiles. The 1 x U matrix N represents the error that is 
made when we approximate V by AX. The J x /a matrix A consists of k basis vectors, The columns 
of the matrix X give the combinations of these vectors needed to approximate each user in V. The 
matrix A is calculated in such a way mat the noise N Is minimized, To build the model we use an 3SM 
algorithm, where as initialisation we take a random model A and if, =* 1. Then the algorithm proceeds 
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M « C0J + A*A)-\ 
* m MA T V f 



where tr (trace) is the sum of the diagonal elements of a matrix. Mote about the EM algorithm and 
its application to factor analysis can be found in [8, aa, 4J. The time complexity of this algorithm is 
0(UI&). The number ofbasis vectors h is usually chosen small (leas than twenty). In order to secure 
the user profiles we split the calculations (a^o) in Section 24.1 between the users and the server. 

Example 3.6: If we run a factOTwmalysis algorithm with fc » 2 on the tea and coffee e**rrrpl» we get 
the model depicted in Figure 2.2. Before the iteration starts, we subtract me mean votes, such that the 
mean of the user profiles is 0. 
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Figure 2,2: V = AJC 4* iVtbr me example oflaWeax 

The rnatrix A has two columns corresponding to the two baKla vector^, Aukfe has a combination 
for these two basis vector? of sr^^ => {-1.1,1.5}. The prediction for flavour T$ calculated by 
multiplying row T3 of the matrix A and the vector x of Aukje and adding her mean vote, pAukie/T* 
-1 - (-1 .1) + 1 . 1,6 -f- 2.8 = 6.8. We can conclude that AuJcje is incredibly fond of tea number three, 
We round the prediction down to 5, as this id our maximum vote. 

2.4.1 Recurrence relation 

The EM algorithm for factor analysis consists of two steps. In the first step (combination calculation) 
We calculate; the combinations X such that they describe the user profiles as good as possible given 
the model A. In the second step (model calculation) we calculate a new model A based on these 
combinations X such that the noise is minimized* 

Combination calculation. Given model A and variance ip we calculate a combination X that de s cribes 
the matrix V the best 

M = ftW+A^A)- 1 
X = MA*V 
These calmlntlnnn can be split among the users aa follows 

Ay es J^A 

My = (#J+Aj^)-* 
Ky « MyA*V» ( a , aX ) 

Ohe matrix Ryisa diagonal matrix wim dements {R^a = r yi . The rows of the matrix A correspond 
to the items in the system. Row i of matrix Ay is 0 when item i is not rated by user tr. If item <is rated 
by user y t row < of matrix A^ is equal to row 4 of A. Note that the items the user did not rate are not 
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Model calculation- Given the 4atabasa V and the combinations X. we calculate a new model A with 
minimum variance ^. 

A « VX T (XX T ^U^M)" 1 

The first equation can also b a written as A(XX? 1- U&M ) » VX r . Splitting the calculations over the 
usersgivesu* 

Now we make a vector of the matrix A by putting the row of the matrix behind each other, and denote 
this vector by £(A), Similarly we have j&(S^ = iV v sJ r ). Furthermore, we -want to use the kronecker 
product®. 

Definition 2.x Let A he an n A x tyia matrix i?iih dements and ktSb&ann&x matrix, then 

A®B< 



Tiiis matrix consists ofn^ x tha fclccfes tfsizens x tob- 
Ihen we have the equality 

Now wo only have to split the calculation for ^ between the users. As the trace is the sum of the 
diagonal elements of a matrix, we have that \xty\> T ~AXV*) = tr(Vv r ) - trCAXV*). The first 
term can be rewritten as 

The second term is obtained in the flame way. 



Now we have 

1 



The total number of votes is equal to m the algorithm, the users can calculate die parts 
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The sewer calculates the sums over the u^ers and makes a new model 

«=» y=i 

In the calculation of ^ we use that 

f^A^v*) • trCA^B'") o ^A^f^). 
v=i s=i 5^1 

The value J la an approximation for &,|.tfae rwmper of votes. In the first iteration we can for 
instance make an approximation for 2^ by computing fguchthat 0 = 1. 

When a user wants recommendations, he first downloads A and 0. With this model he calculates 
hiscnmbina&oasvia i^h By multiplying he can calculate the predictions Ax. 
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Chapter 3 



Encryption 



Hiis chapter describe? the buil4irig blocks that -we use to achieve the desired security. Most of the 
methods for encryption xely on a problem that is easy to solve when some knowledge (a secret key) is 
available and difficult to solve when that specific knowledge is not available. 



3.1 Basic operations 

Uifi user-based algorithm requires the calculation of inner products between profiles of different 
users. The item-based algorithm requires the calculation of sums over the users, where the elements 
have to remain secret 

Example 3.E Ankje rated tea Ti and Ta, Jan rated tea Tx and T3. If they want to knowfcow many tea* 
th ey bpth rated without giving information about which teas they exactly rated, they need a protocol to 
calculate the inner product between their rating vectors. The rating vector r A of Ankja is (1, 1, 0) and 
the vector rj of Jan is (1, 0, 1). The inner product between these vectors is 1, which is precisely the 
number of teas they both rated. Of course* Ankje has to protect her data, which she does by applying 
an encryption function e(r), where r is an element of her rating vector r A , In this way, she obtains the 
encrypted vector £e(r Al Ti)ie(r Al ^),e(rA l Ts)). Because nobody is allowed to read her data, we need 
an operation MULTIPLY with the property 

MULTIPLY (mufifaa)) = *(mi . ma). 6-*) 

and an operation SUM with the property 

SUM (mi,ma) = e(m\ H- m*)- 

Jan elementswise multrpEcates his vector with the encrypted vector of Aukfe, by appfying the 
MULTIPLY protocol This results in the vector (eCrj^r^Ti)* • - • i^^/re^A^a))* Finally, he ap- 
plies the SUM protocol to obtain the encrypted inner product, while the data ofAnkje is not revealed 
to him. The inner product is decrypted hy Auk|e > as she has the secret key, A cryptosy3tem which 
possesses the ousted properties for the SUM and MULTIPLY operations is described in Section 

Example 3.2: Aukje, Jan, Wim and Amout want to know how many times an opinion is given about 
tea Ta, without revealing whether they rated the tea or not, Ta is rated by Wim, Amout and Aukje and 
not rated by Ian, Le. rw lT a = W/ra ° Lr*^ « l>audr J|Ta => fl. So ^ are three persons 
who gave an opinion about tea Ta. To protect these date, they apply an encryption function s(r). The 
encrypted sum e(*w,T2 + vat,** + *- AU|T 2 + r^ m ) is obtained via a SUM operation as described in 
the previous example. The problem is that none of them can have the secret key. The solution is to 
share the key among the users. They can only decrypt a message if they cooperate. A cryptosystem 
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with key gVianttg is described in Section 3.3. Another solution is to use two independent servers. The 
first server calculates the encrypted sum, the second server has the key and can decrypt the message. 

A system that ha? the properties described in the above examples is the so-called PofUjer system, 
which is described in the nest section. We conclude this section with some notations that we will use 
in die remaining sections. . 

• Let Z* {& g ZjO < a? < n,ged(:a,tt) ^ 1} denote the multiplicative subgroup of integers 
modulo n. The size of this group is denoted by If the prime decomposition of n s* 
iff*?-*? tnen^) = pJ*- T (Pz - iM^tea - lMr*Cft - «- 

• The Carmichad's function is defined as the smallest integer m such that for all ^ € 2£ 
w TO mod n m L We can calculate the values for A(n) with the following recurrence relation. 



if7*ttj^and(p<=3 2Afl<2) V(p>3), 

Icm(A(2 B ) > ACpS*),...,A(a>| fc )) If n~2«nti*f- 

In accordance with the definition of A(r>). CarrnichaeTs theorem states that if w € £* then 
w A(n) 5 1 (joaod n). 

• An integer *i<5 eaid to be an r4h residue modulo n if and only if there exists s ome integer s <S 
such that z=*to r mod 

• For each positive integer n t the size (or order) of n i? defined by |n| = fbg a A|. 

3.2 The PallHer cryptosystem 

This section introduces the PallHer cryptosystem [17]. This system possesses the desired properties 
of the operations MULTIPLY and BUM as denned in Example 3.1. The implementation of these 
operations is described hi Section 3.2.2, which describes the algorithms needed lor encryption and 
decryption too. The subsequent section describes some theory behind these algorithms. 

3.2.1 pasics 

First we explain ihe problem where the cryptosystem of Paillier relies on, which is called the composite 
degree residuostty class problem. Next, we show how the problem can be solved when we have some 
specific knowledge. Before the problem is explained, we define set B as 

& =» {3 6 ZJklloJ « |oen| for an 02 » 1, . . . , A(n)>. 

Definition 3.1 (Composite degree residuosity das? problem) Given w e W* 0 e Btrytofindthe 
reshhtosity classx e Z n for which there exists a 3/ e S* such that 

Ifte class q/W r«pccf to g is denoted by r 
Jig € B we have that the function b 9 i — t 2fo defined by 

Zgfa y) a rood rt 2 

is bijecuve, te, the composite reslduoaity class problem has a unique solution given a ^ e If we 
choose n =3 pq, where p and q are two primes, then the Garmichael's function » lem —1,0—1). 
In the following, we xvul refer to this A(n) as A. Cue to Carrnfchael's theorem we have that for any 

to* e 1 (modn) (34) 
^ 5 1 (modn 2 ) (3.5) 
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THE PARLWR CRYPTO$YSTEM a 3 
Theorem 3.x tor amp we Bfo, <i^alf (wh+nAft (mod n Q )* 

Proof As )1 + n| «= |n|» we have that {l + n) <s B 9 so mere exists a imique pair (n.6) € W*> x 2£ with 
fce property mat uj « (l + n) tt & w modn*. By defiditton, a » By maldng use of (3,5) we have 

io* ■> (I + ny^b^ = (1 + n)**. 

Because (1 + sr*)* 5 1 asn (mod n 2 ) we have mat 

(l + nJ^sl + oAn (modw 2 ), 

which yields the result 

Definition^ i%r any u € {a < n*fo es I (mod 7*)}, ivefojina 



Where the division is calculated bymidtiplying with the inverse of mod n a ) modulo n, The latter 
equality can be shown by using the fact that by definition ^ = (n 4- mod foracextam 

and hence, 

^Tafe^bJ mod n 9 (n + l)N»+i&S» m od n 2 « to, 
where AaSS* and ba^b^Hx 

3.2.2 Encryption and decryption 

In short the encryption and decryption of a message goes as follows 

Regeneration. Choosetwo laxga primes p,y ando € E 
1). rThapublic Jcey is thepair fag), the private keyisA. 

Encryption, The iiser ^0 wants to send a mesjj^ 771 € to a receiver ivimpubUc keys n and 5 
mates a riphmext o = £(771) » g^ T n mod where r is randomly chosen from In this way the 
message m can not he obtained by trying the possible values of m. 

Decryption. The recefrer can obtain the menage jn via: 



where the division is taken by multiplying with the inverse of Ltf mod n 3 ) modulo In Example 
3.x we used a SUM and MULTIPLY function. In the PaOuer system, the SUM function is im- 
plemented as a multipltanbn of the dphertexts, and the MULTIPLE function is implemented by 
simply taWng powers, as 

e(mi)s(*na) = o^rfc^rj mod n* « o^+^frirft) 11 mod n a = fi(m x +m a ). (3.6) 

€(mxr a « fc^r")** ^ „ ^mufta^yi mod ^ ^ ^f^^). (3,7) 

A dphextext can always be changed into another dphertext without affecting the plaintext with the 
property given by 

b(wh)i- w = g^rfi* mod n 2 « o^foO* mod * s « e(m;), (3.8) 
where die dphertext of me messagem* is changedby multiplying with where r is randomly chosen 



By making use of Theorem 3.1 the regiduosHy dags [vi]^ is 





from JS*. 
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Example y.y Choose p a 5 and q » 7, then n « 36, «= 1226, and A =* 12. A valid choice for 
^ = 80, as |36| « 6 and |1 • 35J «= 0. Now Aukje can encrypt her vector (1,1,0) as follows 

Se 1 -^ mod 1225 = 1027 
SO 1 ^ 35 mod 1225 « 639 
30° • 29 s5 mod 1225 = 99 

She sends her encrypted vector to Jan, Jan applies the MULTIPLY and SUM operations 
1O27 1 «039°-99 1 mod 1225 = 1223 and sends the encrypted inner product back to Aukje. Aukje knows 
the secret key, so she can calculate the inner product by applying die decryption function 



((1223** mod 122«-l)/35 . 0 . tn m ^ oa , 
((36** mod 1225) - 1)M m0d85==12 ' 3 mQd 85 " 

where 3 is the inverse of 13 modulo 35. 

3.3 The threshold version of the Rainier cryptosystem 

A threshold cryptosyBtem fel allows any subset of * + 1 out of 4 users to decrypt the dphertexc but 
prevents decryption if at moat t users participate. 

Key generation. 

• Choose two strong primes p, q. i.e, j> 2j/ + I, q » 2^ + 1. where y and g' are prune too. Set 
rt c= jpg and a $ , 

» Choose random/?, ©,& s 2* and set g = (1 + n) a 6* mod n 3 . The secret key * = /ftr*. Set 
^=3au r pmod9v 

• The public key ig the triple (n, o, 0) and the private key is s which is shared among the users 
with key sharing. 

Key during. Ihe Shamir key sharing scheme [24] is based on polynomial interpolation. We can 
retrieve a polynomial f(X) » JQj^ ^x* of degree* with unknown coefficients a.-, if the values / fe) 
of* + 1 points a* 3tfe town, Le. 

■wo-S n f=j /(«.). 

Ob share the key 0 among the users, we choose a polynomial of degree < such that /(0) <zo =» The 
other coefficients Oj, where 0 <« s t are randomly chosen from Z^*, Now we give each user taa 
point (t*. s^), where a* a /(u) mod nn'. We call ^ the share of user u. Then the secret key s can be 
reconstructed via: 

«en^n\fy> w * nan 
where ft is a set of t + 1 users with a share a tt . and L ut i « n»en\{tA> 5^5 ■ 

Encryption, Tb encrypt a message meZ„, compute c = e(m) « 0 m r*mod n 2 with a random 
r e S£ just as in the normal Pafflier system. 

Share decryption. Each user u computes a decryption share a* = £'<c) = c 2 *** mod n 2 , where 
A^£L Note that the share $ u of user w, is hidden in the s ame way as the mrefrngQ 77*. 
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End decryption. Let ft be a set of t -f 1 valid shares. Then <be message m la retrieves} by: 



m = ^(C7) » L(ll mod ^J-* mod n 



whera/^sss Al^n and C = {ou|u e ft}- 

Example 3.4: Choose p* = 2 and ^ =* 3 thenp = 5,? = 7,n= 35, n a « 1225, and n' = G. Choose 
5 = 36 (Le. a = land 6= 1), and 0 = 4, resulting in a 6 • 24and0 ==24. There are four users, 
so A = 4! = 24. The public keys are 0 *» 24, n = 35, 9 => 36, We share the secret key 3 « 24 in such 
a way that two users must collaborate to decrypt the message (Le. * «= 1). We hence make a function 
/(a) = 24+ 3a? with degree l. where the first coefficient is $ and the second coefficient (3) Is randomly 
chosen. Wlm get? the share 24+ 3-1 =2? and Jan.Amouf^ukja get the shares /(2) = 30,/ (3) « 33, 
and /(4) ^ 36. respectively. With the use of the public keys and the SUM operation they can encrypt 
their messages 



The end decryption calculates the message 

((106*- 48 - 626^ d mod 1225) • (4 * 24* ■ 24)-*) mod 35 « (23 - 2$) mod 35 = 3, 
where O » {1, 2), = Ag§y = 48 and /z 2 = A j-^ = -24, The number of users that rated Ta is. 3. 

3.4 The El Gama! oryptosystem 

Every crypto system that has an implementation for the SUM and MULTIPLY operations can in 
principle be used instead of the Pauliar system. Sometime? a little trick can be applied to obtain the 
desired properties, such as in die 391 Gamal crypiosystem [jo]. This system works as follows. 

Key generation. Choose two primes p, q such ihat q]p - 1, and a value 9 £ Choose a secret key 
s € £ 0 . The public keys are & 9 and g* mod p* 

Encryption. Choose a random r 6 Then the encryption function is 
(ci, 02) = e(m) = mod j>, mod p), 
where $ sr mod p Is calculated by raising 9* rood p to the power r. 

Decryption. The decryption consists of two steps. 




Now we need two user? who calculate the decrypted share 



Wim: Ugl***flT mod 1225 » X06. 



Jan: liai !ta4 ' afl mod 1225 « 526, 



Ai(ci) = cj mod p = 0" mod p, 

<$a(cfc) = o^Jj" 1 ($1) mod p » W9 aT g mm ' T mod p » m, 

where 9"^ is the inverse of in &p. The system has the property mat 



(3-9) 
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g^g** mod p jf** 1 * ma&p t 
mxg^mss/* modp w»i?^^ n ' w mod p. 

We can turn the El Garoa) eryptosystem into a system with and MULTIP1X operations by 
encrypting -f* instead of m. where 7 £ and 7 is known to all users in the system. The message m 
can "be retrieved by trying all the possible xnessngas. 

Bjsunple 3.5? Suppose we choosey « XX,g== 3,3 ^ 3»s = 4, and 7^ 2. Then Aulcje encrypt her 
vector as follows: 

etf) bs (3 1 mQdll»2 1 3 4 ' 1 modXX) = (S r 6) 
fi (7*) =r (a*mc^ll,2 l 8* a modll)~(MQ> 
= (S 3 modll l 2°3 4 ' 8 jiMidll)«(6 f 9) 

Jan applies the MULTIPLY and SCW operations ofthe El Gamal system, (S 1 * 9° • o 1 mod 11, 8 1 • 
10° < 9+ mod 11) « (4 f 6). He sends his result hack to Aukje. Aukje calculates 6 • 4T* mod 11 ~ 
6 . 4 mod 11 = 2. The inproduct is 1, as 7 1 = 2, 

In the same way other systems with property ($.9) can he turned into systems with the same 
properties asthePaiUier system. A disadvantage of these systems is the search for the correct message 
in the last step, -which can he time consuming. Therefore the system is only used when m takes a 
limited number of values. The El Gamal system has a threshold version too, which is obtained in 
the same way as me PaHHer threshold system, A more detailed description ofthe 0 Gamal threshold 
system is gjven in [5, 6], 
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Chapter 4 

Protocols for the User-Based Algorithm 



In this chapter we shall derive protocols needed for the protection of valuable data in user-based algo- 
rithms. 

4. 1 Protocols for the similarities 

In this section we derive protocols for the similarity measures mentioned in Section i.zs. 
4,1.1 Protocols forttie distances 

Mean-square difference, The mean-square difference can be rewritten as 

where the vector v£ is the vector with elements and items that ore not rated receive a zero vote, 
Lb* = 0, The mean-square difference consists of four inner products between the users. These 
four inner product? can b e calculated in the way described in Section 3*a.2u The active user calculates 
bis vectors r a , v a and first, as shown in Figure 4.1, He encrypts all entries in the vectors with the 
encryption function of the PauHer system. These vectors are sent to the server. The server sends the 
vector? to the other users in the system* The other users have already calculated their vectors r mt v x 
and vj. They can calculate the four encrypted inner products, e(t f a r a ),e(T f ^)> 6(v£v B ), and s(v2'ra.) by 
taking powers and multiplying all elements in a vector. These inner products are sent to the server and 
the server sends it to me active user. The active user can decrypt the four inner products and calculate 
me mean-square difference. The" sum rj.v2 can reveal information about items another user rated* As 
the server takes care of the conversation between two users, the other user stays anonymous and the 
active user only knows the value of the correlation with another (anonymous) user. 

Active user Server Other users 

•te),«C*).««> — S~ — ^f— - 

*%Tar 

Figure 4.x: Protocol for the mean-square dirrerence. The numbers above the arrows denote the num- 
ber of messages mat axe sent by the active user a, by the server or by another user x. 
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M^ama distance. We were not able to calculate absolute values in a secured way. We can cnlmlate 

S Jf^ 8 fa ^ but then we have to decrypt the difference m 

order to see if it is positive or negative, 

4.1.2 Protocol for the correlations 

^^J g condna o n - We start with rewriting tha Pearson correlation. Let us define a vector w D with 

ta-s/ tfie/o. , . 

^ rt I 0 otherwise. (V 1 ) 

We write w 3 for the vector with elements Then the vector notation of the Pearson corr«lation Is 

Sj^nr-^-gaX^-g,) wlw„ 

The correlation consists of three timer products between the users. We can use the PaOtier system as 
explained in Section 3-a.a to calculate this inner products without giving information about the users 
testes. 

Active user Server other user* 

*KMw2),e(r 0 ) SL-_ SEO_ 

Vrw^wa * 32 " ~ ^<Wb).^"2)^(«i«S) 

Kgure 4^; Protocol for the won. correlation. 

The active user sends the encrypted vector a(w«), and *(r a ) to the other users, as shown 
,J-§f\±** ? e other qsers ^culate the three encrypted inner products, c(w>A s(r>£) and 
e(TiwS), These inner products are sent via the server to the active use* The active user can tecrvot 
the three inner products and calculate the Pcareon correlation. The server should take care that the 
other users stay anonymous. The inner product r^wg in the nominator reveals some information 
about itenas anotheruser rated. 

Constrained Pearson correlation. The constrained Pearson correlation can ho secured in an identical 
way as the Pearson correlation. 

4.1 .3 Protocols for the counting measures 

Wd^ted lmppa statistic The weightedtappa statistic can be written in terms of inner products too. 
Let us define the vectors r^, with elements 

if^^u, 

otherwise. (4-3) , 

We can write an element of me matrix p a:D in vector notation as 

\l*nl a \ r^r* * 

An element of the matrix g aa is written fax vector notation as 



(«*»)<={ ; 



i 

! 



1 
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Now construct a vector p^ of the matrices by putting the columns behind each other. Tien the 
observed value = p'^ and the expected value ^ where w is a vector of the weights 

matrix. If we hive the observed value and the expected value, then ma kappa statistic can be calculated 
in the normal way. 



Active user Server Other users 

«*>.«(«*). 6i — - % ^ , v ^ v 

q M -i — — e(rir*), ftftvfe)> ff^r.) 



Figure 4.3: Protocol for the kappa statistic. 



Hie protocol (Hgure 4.3) mate use of the Paulier system as described in Section 3.0.2. The active 
ug&r calculates the vector? x a ondt qv for v « 1,... F 5. These vectors are encrypted and send to the 
other users. The other users apply an inner product protocol to the vectors. The encrypted inner 
products are sent to the server, the server calculates the encrypted sum s(£ y 4^^) ™ 
inner product protocol and sends mis sum and the remaining inner products sfar*), eCr^r*) and 
efc r fl ) to the active user. The active user can calculate the inner product p'a> and the vector q. As 
we as sume that the weight matrix w is public the user can calculate the inner pro duct q'w and hence 
obtain the kappa statistic 

Majority voting. Recall thjit the mafc rity voting similarity is 

where =» \{i 6 4 O JcjVrf e CKOU « K< € f tt 0 £ Define the vector 

Tuftsin (4.3), wen we can rewrite as 

Oa* = ^ J ^ ] r^v r cruj» 

v meet*) 

Bndw M as 

i^ax = r^ron e«j5- 

The active user calculates his vectors r a and t ot for w =* 1, . , . , 5. These vectors are encrypted and 
send to dxe other users, as shown in Figure 4.4. The other users apply an inner product protocol to the 
vectors. The encrypted inner products are sent to the server, the server calculates the encrypted sum 
SXeccvi *ii;*arw) 011 SUM Protocol and sends this sum and me remoming irmer product 
e(riO to the active wer. The active user can calculate and ^ and hence the majority voting 
Similarity. 
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Active user Server Other users 



^ 1 ^ eCr^f x ) 



Figure 4.4: Protocol for the majority-voting similarity. 



4.2 Protocols for the predictors 

Standard predictor, Recall that the standard predictor is given by 

where the set Ui could be restricted to the users with a sintQatity above a certain threshold. Such a 
restricted set is called U[. Define the veetors a as 

*M ifusera: e E//, - . 

**» \0 otherwise, 

Then the standard predictor is rewritten as 



where iffyi is denned as in (4.3). As the user know the sbnuarilies, he can construct a vector s a 
and a vector |s 0 | with elements js^l. The fictive user encrypts this vectors and sends them to the 
server (Figure 4,5). The server sends the elements of the vectors to the appropriate users. The users 
calculate the encrypted multiplications e^SayW^) and £(\a Q y\r U i). They can add a random term via 
(3,8) to protect their data even better. Subsequently, their messages are sent to the server: The server 
calculates the encrypted sums and sends them to the user. The user can decrypt the messages and 
obtain the prediction, 

Active user Server Other users 

«M.«CW>— * 

■as****). " — — <^^.«i*.i*> 

figure 4,5; Protocol for the standard predictor. 



Simple predictor. The simple predictor can be secured in an identical way as the standard predictor. 



• 



r 
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Majority voting predictoR The active user encrypts the elements $(a, x) of the vector and sends 
them to the server, as shown in figure 4.6. ike server sends the elements of the vector to the 
corresponding users. The other users calculate eW^aJfee)^ tor u = 1>.. . ,5, where is de- 
fined as in (4.3) and send it hade to the server. The server collects the information and calculates 
eCC* fE)feu)i)r for v « 1 T . . . , 5. The user can decrypt these sums and calculates me prediction. 

Active user 
e(s ft ) ■ 



£ a (s(a.a?)(r CT )<) 

Figure 4,6; Protocol for the majority voting predictor. 



Server Other users 

-2 K -2 ^ 
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Chapter 5 

Protocols for the Item-Based Algorithm 



In this chapter we shall derive a protocol for the item-based algorithm. We shall use the adjusted 
cosine (zj&) as similarity measure and the standard item-based predictor (2,19) as predictor; Other 
similarities, as described in Chapter 4 can he obtained in a similar way: 

5.1 Protocol for the adjusted cosine measure 

The adjusted cosine similarity is given by 

where w is defined as in (4.1). lfrery user can calculate his own mean. Hie adjusted cosine consists of 
three sums over the users, in which each user can calculate his own part If a user did not rate both 
items he can s end a zero b acfc. Wo can calculate the sums in two way?, the first of which is m us e the 
threshold cryptosystam like in Section 3.3. The users calculate their part of the sums and encrypt them 
with the public fcey of the key^harteg scheme (Figure 5.2). Hie server collects the parts and computes 
the encrypted sums by multiplying the different parte. Then he sends the encrypted sums back to the 
users. The users can apply their secret share and send the result bade to the server. If the server has 
enough decrypted shares, he can decrypt the three sums. 



Server Users 
S B "*f*w> £* <, 

Figure 5,1: Protocol for the adjusted cosine using key sharing. 

Another possibility is to use two servers instead of one, as shown in Figure 5.2, In mat case, the 
parts of die users are encrypted with the public key of the decryption server. The encrypted parts are 
sent to the recommender server; who calculates the sums and sends them to the decryption server. The 
decryption server can decrypt the sums, and calculate the similarity between the items . This similarity 
is sent back to the recotnmender server. In order to work well, the two servers should be independent, 
as otherwise the messages of the users can be decrypted. The decryption server should be sure that 
the values given to it are encrypted sums, and not the encrypted messages of the users. 

3» 
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Decryption server Recommender server Users 



3 effi,^). - S e^g,), 



Figure 5.2: Protocol for the adjusted cosine with two servers. 



5.2 Protocol for the predictor 

Recall that the item-based predictor is given by 

The item mean is & stun over the votes from all users dfricjed by -the number of users who gave their 
opinion on the item. The similarities between the items are stored at the server, just as the item 
means* We start with deriving a protocol for the calculation of the item means, 

5.2,1 Protocol for the item mean 

The item mean is a sum over the upers, so for the protocol we can. either use a threshold cryptosystem 
(Bigura 3.3) or a two server system (Figure 5-4). fust like in the protocol of the adjusted cosine. The 
users encrypt their vote v& and encrypt the indication they rated the item r^, with the public kef of 
the tey^haring scheme* The server collects these values and calculates the encrypted sums. These 
sums are sent back to the users who can apply the secret share. The decrypted shares are sent hack to 
the server The server can decrypt die sum of die votes, and the number of users that rated die item. 
Hence, he has the mean of the item. 



Server Users 

— -J sfodiefr*) 



—2 ^E s a,^(C^)) 

Figure 5,3: Protocol for the item mean using key sharing. 

The other option is to use two servers. In this case the users encrypt with the public key of the 
decryption server, but send their encryptions to the recommender server. The reconamender server 
calculates the encrypted sums and sends them to the decryption server; The decryption server decrypts 
the sums and calculates the item mean, which is sent to the recommender serves 
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Decryption server Recommender server Users 

5, L.. ■ ^ 

Figure 5^: Protocol for the item mean with two servers. 



6.2.2 Protocol far the standard item-based predictor 
We can rewrite into 

The server has the knowledge of irte similarities between the items, so it can calculate the encrypted 
normalization constant = eCC<L a , and send it to the user, see figure 5.5, The server 
also calculates another encrypted constant e(W) =s s(fyfc - X^Li r ai*U> ■ 1^ user encrypts bis 
vector with votes and sends it to the server. The server calculates the inner product -wife -the similarity 
vector Sj. The constant M is added. The server sends the sum back to the user. IJte prediction is tins 
sum divided by the normalization constant k. 

Active user Server 
*(v«).s(r*) 2I__ 



Figure 5.5: Protocol for the item-based prediction. 
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Chapter 6 

Protocols for the Factor-Analysis 
Algorithm 



A protocol for factor analysis must protect the user profiles and the model A. as it contains information 
about correlations between the items. As the user needs to use the matrix in an un-ancrypted way, we 
decided to use a so-called personal serves 

6. 1 The personal server 

Hie personal server is a piece of hardware and/or software installed in the user's device. It could for 
instance he installed in the internet radio player Streamium. which is developed at Philips Research. 
The personal server has the following properties. 

• The personal server may know the information of its user, 

• The personal server may not send information directly to the server; but only through the user. 

• The personal server may know information about the central server. 

• Only -the central, server decides which information on the personal server is given to the user; 



Profile-.. 





Usery j— 








1 


-Personal 
.Server^. 




Central 
Server 













Figure 6.x: The personal server with un-encrypted information streams. 

The factor-analysis algorithm has information streams as depicted in Hguxe 6 jr. The server sends 
the model A, # to the personal server, and the user sends his profile to the personal server; With this 
irunrmation the personal server can calculate new predictions for the user and update information 
for the server. The update information Ay, By and O v , as given in (zm), is send via the user to the 
server; The server can compute Av* £yU &v ^ °p* which are used for the update of 
the model A, ^. The central server decides how many recommendations a user receives. The security 
protocol for factor analysis can be split into two parts. The first part is the protocol for the model and 
the secor4 part is the protocol for the user profile. 
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6.2 Protocol for the model 

To avoid a lot of computations, we assume that the personal servers of all users have the same public 
key &nd private key. Therefore, model A and variance <0 are encrypted only once per iteration with the 
public key of the personal server (Figure 6\a}. The personal servers are the only ones who can decrypt 
die model A and variance The user of the personal server does not own the private key, so he can 
not decrypt model A and variance 

6.3 Protocol for the profiles 

Hie protocol for the profiles makes use of the threshold PaUlier system described in Section 3.3, The 
users send the encrypted update information Ay,B y and C v to tha server, as shown in Figure 6 z. 
The server can calculate the encrypted sums e(£y Ay) , * (£ j^) and e(£ y C v ) by multiplying the 
incoming information, These encrypted sums are sent bade to the users. The users calculate the 
decrypted shares 6'(e(£ v A v )) t and <f'(e(E„ Cy))- The decrypted shares are sent back 

to the server. If the server has enough decrypted shares available, he can decrypt tfee messages and 
obtain ^A v> J^B v aad'22 y C f , t which he uses next to update the model. 

Users 




Figure Gjzz Protocol for factor analysis. 

In a key-sharing scheme, the tunning time is innrpn/rine when the number of users increases 
(Chapter 8). Therefore we spBt big groups of users randomly into smaller groups, thus reducing the 
running thne of the algorithm, The server can calculate with the protocol mentioned above the sums 
per group. The total of these sums is exactly the rnforrnarionthe server needs. Instead of performing a 
key sharing protocol, we could also use & two server protocol, as shown in Figure 6.3. Then we encrypt 
the update information of the users with the public key of the decryption server, and send it to the 
recommender server. Hie recornmender server can multiply the incoming torbrmation and sends the 
resulting encrypted sum to the decryption server. The decryption server has the secret key, so it can 
decrypt the result, which is sent bade to the recornmender server 



Decryption server Recommender server Users 



Figure 6.3: Protocol factor analysis with two servers. 
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CLAIMS: 



J , A method for content recommendation, comprising a step of collecting at a 

central server encrypted rating vectors from at least two users, a step of collaborative filtering 
using the encrypted rating vectors so as to protect the users' privacy, and a step of sending a 
content recommendation to a user. 

2. A method as claimed in claim 1, wherein the collaborative filtering uses at 

least one of vector inner products and sums of shares. 



3. 

10 

4. 



A system for implementing the method of claims 1 to 2, 

A computer program product for implementing the method of claims i to 2. 
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ABSTRACT: 



The invention is to protect the users' privacy, given by their rating 
information, by rewriting the computational steps required for the collaborative filtering 
algorithm into vector inner products and sums of shares, after whioh we apply the mentioned 
encryption techniques to protect than. In a sense, this means that only encrypted information 
5 is sent to the central server, and all computations are done in the encrypted domain. 
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